Virtual machines recoverable from uncorrectable memory errors

ABSTRACT

The disclosed technology provides techniques, systems, and apparatus for containing and recovering from uncorrectable memory errors in distributed computing environment. An aspect of the disclosed technology includes a hypervisor or virtual machine manager that receives signaling of an uncorrectable memory error detected by a host machine. The virtual machine manager then uses information received via the signaling to identify virtual memory addresses or memory pages associated with the corrupted memory element so as to allow for containment and recovery from the error.

BACKGROUND

Cloud computing has impacted the way in which enterprises managecomputing needs. It provides reliability, flexibility, scalability, andredundancy in a cost-effective way. It enables an enterprise to manageits information technology needs without, for instance, traditionalcapital investment and maintenance considerations. As more and morecomputing shifts to cloud systems, these systems now store, process, andoutput data on a scale that years ago was likely unimaginable. An effectof this shift is that memory errors that occur in the cloud, if notcontained and/or recovered from, can impact customer or user experienceon a scale corresponding to an enterprise's footprint on the cloud. Forinstance, it is not untypical that detection of an uncorrectable memoryerror on a host leads to shutting down of the host, resulting in abrupttermination of all hosted virtual machines (VMs). With memory sizes onthe gigabyte or terabyte scale, that may impact thousands of VMs orapplications that require lengthy time periods to reestablish.

BRIEF SUMMARY

Aspects of the disclosed technology may comprise methods or systemsimplemented in a cloud computing environment that allow for containment(e.g., protecting DMA accesses from corrupted data) and recovery fromuncorrectable memory errors.

Aspects of the disclose technology may comprise a method. For instance,the method may be a method for uncorrectable memory error recovery in acloud computing environment. The method may comprise: receiving, at avirtual machine managing manager, signaling of an uncorrectable memoryerror detected by a host machine; determining, at the virtual machinemanaging manager, one or more virtual machines associated with acorrupted memory element based on the received signaling; emulating amemory error associated with the corrupted memory element based on theuncorrectable memory error; and introducing, by the virtual machinemanaging manager, the emulated memory error into an operatingenvironment of at least one of the one or more virtual machines.

Additional aspects of the method may comprise introducing the emulatedmemory error, comprising the virtual machine managing manager injectingan interrupt that is accepted by a virtual central processing unit(vCPU) of each of the one or more virtual machines. Further, theemulated memory error may comprise a notification that causes the atleast one of the one or more virtual machines to signal theuncorrectable memory error to a guest user space. The emulated memoryerror may also comprise a notification that causes at least one of theone or more virtual machines to be restarted or terminated. The emulatedmemory error may comprise context information associated with theuncorrectable memory error including one or more of a location, a type,or a severity. A virtual machine managing manager may comprise ahypervisor.

In accordance with the method, signaling may comprise a BIOS of the hostmachine forwarding information associated with the uncorrectable memoryerror to an operating system of the host machine. Further still, themethod may comprise the operating system of the host machine forwardingthe information associated with the uncorrectable memory error to thevirtual machine manager.

Additionally, in accordance with the method, introducing may comprisethe virtual machine manager injecting the emulated memory error into aprocess of a virtual central processing unit of the at least one virtualmachine. Further still, determining the one or more virtual machinesassociated with the corrupted memory element may comprise identifying atleast one memory page associated with the corrupted memory element.

Aspects of the disclosed technology may also comprise a cloud computingsystem. The system may comprise a host machine capable of supporting oneor more virtual machines, and one or more processing devices coupled toa memory containing instructions. The instructions may cause the one ormore processors to: receive signaling from the host machine, thesignaling indicating an uncorrectable memory error; determine, fromamong the one or more virtual machines, a virtual machine associatedwith a corrupted memory element based on the received signaling; andemulate a memory error associated with the corrupted memory elementbased on the uncorrectable memory error. The instructions may also causethe one or more processing devices to inject the emulated memory errorinto an operating environment of a virtual machine associated with thecorrupted memory element.

The instructions may also cause the one or more processing devices toinject the emulated memory error causing the one or more processingdevices to inject an interrupt that is accepted by a virtual centralprocessing unit (vCPU) of the virtual machine associated with thecorrupted memory element.

Further, the emulated memory error may comprise a notification thatcauses the virtual machine associated with the corrupted memory elementto signal the uncorrectable memory error to a guest user space. Theemulated memory error may comprise a notification that causes thevirtual machine associated with the corrupted memory element to berestarted or terminated. In addition, a BIOS of the host machine may beconfigured to forward information associated with the uncorrectablememory error to an operating system of the host machine. The operatingsystem of the host machine may forward the information associated withthe uncorrectable memory error to the one or more processing devices.

Further aspects of the system may comprise the emulated memory errorcontaining context information associated with the uncorrectable memoryerror including one or more of a location, a type, or a severity.Furthermore, the operating system of the host machine may forward theinformation associated with the uncorrectable memory error to the one ormore processing devices. The one or more processing devices may comprisea hypervisor. In addition, the instructions may comprise that todetermine the signaling indicating an uncorrectable memory errorcomprises identifying at least one memory page associated with thecorrupted memory element.

Additional aspects of the disclosed technology may comprise one or morenon-transitory computer readable media having stored thereoninstructions that cause one or more processing devices to perform aprocess or method for uncorrectable memory error recovery in a cloudcomputing environment comprising receiving, at a virtual machinemanaging manager, signaling of an uncorrectable memory error detected bya host machine; determining, at the virtual machine managing manager,one or more virtual machines associated with a corrupted memory elementbased on the received signaling; emulating a memory error associatedwith the corrupted memory element based on the uncorrectable memoryerror; and introducing, by the virtual machine managing manager, theemulated memory error into an operating environment of at least one ofthe one or more virtual machines. The instructions may comprise one ormore other method or process steps of the disclosed technology.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustratively depicts a block diagram of an example system orenvironment in accordance with aspects of the disclosed technology.

FIG. 2 illustratively depicts a block diagram of an example system orenvironment in accordance with aspects of the disclosed technology.

FIG. 3 illustratively depicts a flow or swim diagram of an exampleprocess or method in accordance with aspects of the disclosedtechnology.

FIG. 4 depicts a flow diagram of an example process or method inaccordance with aspects of the disclosed technology.

DETAILED DESCRIPTION

Overview

Memory errors are generally classified as correctable and uncorrectable.Correctable errors typically do not affect normal operation of a hostmachine, and thus a host computing system, in a cloud environment.Uncorrectable errors are typically fatal to the entire host computingsystem, causing, for example, the host machine to crash or shut down. Ina cloud based virtual machine environment, this implies that all virtualmachines (VMs) supported by a host machine will crash or shutdown withthe host, leaving no clue or little chance of recovery to theVMs/user(s). The impact of uncorrectable memory errors in a modern cloudcomputing system is typically significant, as these systems often timesemploy relatively large sized memories per host, e.g., a cloud computingengine may enable a single VM with as much as 12 terabytes of memory.These larger hosts typically experience a higher rate of uncorrectablememory errors than smaller hosts, e.g., more memory translates into morememory errors. Downtime due to memory errors is typically very costly.

An aspect of the disclosed technology comprises a cloud computinginfrastructure that allows a host and its associated VM(s) to stay upand/or recover from memory errors, including uncorrectable memoryerrors, as well as localize and contain the memory errors so that theydo not impact other parts of the system, such as guest VM(s) workloads.For instance, the disclosed technology comprises configuring a hostmachine BIOS (including the associated memory elements) to enable errorsignaling recoverable at an operating system (OS), enhancing andenabling the OS's recovery path upon detection of memory errors onmemory pages. An example of the disclosed technology comprises a centralprocessing unit (CPU) capability that can signal an operating system(OS) with context information associated with memory errors (e.g.,address, severity, whether signaled in isolation such that the error isrecoverable, etc.). Such a mechanism may, for example, comprise Intel'sx86 machine check architecture, in which the CPU reports hardware errorsto the OS. A machine check exception (MCE) handler in the OS's kernel,such as provided via Linux for example, may then use an applicationprogramming interface (API) such as POSIX to signal the user spacesoftware application that triggered the MCE of the errors existence. Theuser space application may be made MCE resilient by providing dataredundancy (e.g., a tiered caching model such as persistent storage thenmemory); native sharding of a local working set of data to minimize lostamount of work; or making the user application as stateless aspracticable such that it can restart without losing work.

An aspect of the disclosed technology comprises a cloud computing systemor architecture in which a mechanism is provided so that a virtualmachine manager or hypervisor includes a capability to be alerted by ahost machine of memory errors, particularly uncorrectable memory errors.The hypervisor, upon being alerted, processes the memory errorinformation it receives from the host machine to determine VMs that maybe accessing the corrupted memory element identifiable from the memoryerror information included in the alert. The hypervisor, uponidentifying affected VMs, notifies them via their respective guest OSsof the memory error by, for example, providing a memory page that flagsvirtual memory locations or logical addresses that are mapped to thephysical memory location or physical addresses of the corrupted memoryelement. The guest OSs and VMs may then avoid using the affected logicaladdresses by moving the instance(s) running on the VMs to other VMs,including the user shutting down those VMs and issuing a command/requestfor replacement VMs. Further, the hypervisor may now avoid thatcorrupted memory element for new VMs requests that may be assigned tothe affected host machine. In addition, for uncorrectable memory errors,the hypervisor may initiate processes to failover VMs running on theaffected host machine so that the host machine may ultimately berepaired.

As may be appreciated, a cloud computing system or architectureimplemented in accordance with the foregoing mechanism can contain andallow for graceful recovery from uncorrectable memory errors.Specifically, by identifying the affecting memory, an hypervisor canlimit or eliminate use (e.g., reads or accesses) of such memoryprospectively. In addition, the hypervisor can limit the impact to onlythe affected VM. In addition, the hypervisor may initiate failover ofthe affected VM, and then manage moving unaffected VMs supported by thecorrupted host to another host, to allow the corrupted host to berepaired. In this way, a customer's or user's exposure to the impact ofuncorrectable memory errors may be limited to only affected VMs whosevirtual memory is linked to the corrupted physical memory element oraddress, while unassociated VMs are kept unaware of the error and sufferno impact from it.

For instance, in applications where re-starting the application or VM isnot enough, e.g., large database applications, corrupted data may beisolated and swapped out. In webservice type applications, aspects ofthe disclosed technology allow all the unaffected VMs to stay alive andkeep running, while only the affected VM is restarted. In GPU or TPUclusters, a similar advantage may be realized, since only VMs or theapplication affected by the memory error need to be restarted.

When an uncorrectable error is signaled, normally permanent datacorruption has already happened in memory. Under certain circumstances,it may still be possible to recover from even those errors. Forinstance, recovery may be possible if the corrupted cache line/pagefalls within a certain condition. Specifically, recovery may be possiblewhere the content of the affected memory page can be reconstructed frompersistent storage, e.g., a SAP HANA in-memory database. Recovery mayalso be possible if the affected process is a non-critical user spaceprocess, in which case it may be acceptable to poison and discard thecorrupted page and restart the process without affecting the main jobson the VM. As another example, recovery may be possible where the mainserving jobs on the VM can survive restarts. This may be true for manyuse cases. In that regard, users usually prefer restarts to abruptlylosing the VM host, e.g., in a distributed training workload, a workerrestart with training continuing from the last saved checkpoint is lessintrusive than losing a machine entirely for an extended period of time,thereby rendering the entire pod used for the training workloadunusable. On the other hand, in instances where recovery is not possiblefor the affected VM, keeping the rest of the VMs on the host alivereduces the blast radius of the memory error significantly. Aspects ofthe disclosed technology may make recovery and containment fromuncorrectable memory errors feasible.

Example Systems

FIG. 1 is an example system 100 in accordance with aspects of thedisclosure. System 100 includes one or more computing devices 110, whichmay comprise computing devices 110 ₁ through 110 _(k), a network 140 andone or more cloud computing systems 150, which may comprise cloudcomputing systems 150 ₁ through 150 _(m). Computing devices 110 maycomprise computing devices located at customer locations that make useof cloud computing services such as Infrastructure as a Service (IaaS),Platform as a Service (PaaS), and/or Software as a Service (SaaS). Forexample, if a computing device 110 is located at a business enterprise,computing device 110 may use cloud systems 150 as a service thatprovides software applications (e.g., accounting, word processing,inventory tracking, etc. applications) to computing devices 110 used inoperating enterprise systems. As an alternative example, computingdevice 110 may lease infrastructure in the form of virtual machines onwhich software applications are run to support enterprise operations.

As shown in FIG. 1 , each of computing devices 110 may include one ormore processors 112, memory 116 storing data (D) and instructions (I),display 120, communication interface 124, and input system 128, whichare shown as interconnected via network 130. Computing device 110 mayalso be coupled or connected to storage 136, which may comprise local orremote storage, e.g., on a Storage Area Network (SAN), that stores dataaccumulated as part of a customer's operation. Computing device 110 maycomprise a standalone computer (e.g., desktop or laptop) or a serverassociated with a customer. A given customer may also implement as partof its business multiple computing devices as servers. If a standalonecomputer, network 130 may comprise data buses, etc., internal to acomputer; if a server, network 130 may comprise one or more of a localarea network, virtual private network, wide area network, or other typesof networks described below in relation to network 140. Memory 116stores information accessible by the one or more processors 112,including instructions 132 and data 134 that may be executed orotherwise used by the processor(s) 112. The memory 116 may be of anytype capable of storing information accessible by the processor,including a computing device-readable medium, or other medium thatstores data that may be read with the aid of an electronic device, suchas a hard-drive, memory card, ROM, RAM, DVD or other optical disks, aswell as other write-capable and read-only memories. Systems and methodsmay include different combinations of the foregoing, whereby differentportions of the instructions and data are stored on different types ofmedia.

The instructions 132 may be any set of instructions to be executeddirectly (such as machine code) or indirectly (such as scripts) by theprocessor. For example, the instructions may be stored as computingdevice code on the computing device-readable medium. In that regard, theterms “instructions” and “programs” may be used interchangeably herein.The instructions may be stored in object code format for directprocessing by the processor, or in any other computing device languageincluding scripts or collections of independent source code modules thatare interpreted on demand or compiled in advance. Processes, functions,methods, and routines of the instructions are explained in more detailbelow.

The data 134 may be retrieved, stored, or modified by processor 112 inaccordance with the instructions 132. As an example, data 134 associatedwith memory 116 may comprise data used in supporting services for one ormore client devices, an application, etc. Such data may include data tosupport hosting web-based applications, file share services,communication services, gaming, sharing video or audio files, or anyother network based services.

The one or more processors 112 may be any conventional processor, suchas commercially available CPUs. Alternatively, the one or moreprocessors may be a dedicated device such as an ASIC or otherhardware-based processor. Although FIG. 1 functionally illustrates theprocessor, memory, and other elements of computing device 110 as beingwithin a single block, it will be understood by those of ordinary skillin the art that the processor, computing device, or memory may actuallyinclude multiple processors, computing devices, or memories that may ormay not be located or stored within the same physical housing. In oneexample, one or more computing devices 110 may include one or moreserver computing devices having a plurality of computing devices, e.g.,a load balanced server farm, that exchange information with differentnodes of a network for the purpose of receiving, processing, andtransmitting the data to and from other computing devices as part ofcustomer's business operation.

Computing device 110 may also include a display 120 (e.g., a monitorhaving a screen, a touch-screen, a projector, a television, or otherdevice that is operable to display information) that provides a userinterface that allows for controlling the computing device 110 andaccessing user space applications and/or data associated VMs supportedin one more cloud systems 150, e.g., on a host in a cloud system 150.Such control may include, for example, using a computing device to causedata to be uploaded through input system 128 to cloud system 150 forprocessing, cause accumulation of data on storage 136, or moregenerally, manage different aspects of a customer's computing system. Insome examples, computing device 110 may also access an API that allowsit to specify workloads or jobs that run on VMs in the cloud as part ofIaaS or SaaS. While input system 128 may be used to upload data, e.g., aUSB port, computing device 110 may also include a mouse, keyboard,touchscreen, or microphone that can be used to receive commands and/ordata.

The network 140 may include various configurations and protocolsincluding short range communication protocols such as Bluetooth™,Bluetooth™ LE, the Internet, World Wide Web, intranets, virtual privatenetworks, wide area networks, local networks, private networks usingcommunication protocols proprietary to one or more companies, Ethernet,WiFi, HTTP, etc. and various combinations of the foregoing. Suchcommunication may be facilitated by any device capable of transmittingdata to and from other computing devices, such as modems and wirelessinterfaces. Computing device interfaces with network 140 throughcommunication interface 124, which may include the hardware, drivers,and software necessary to support a given communications protocol.

Cloud computing systems 150 may comprise one or more data centers thatmay be linked via high speed communications or computing networks. Agiven data center within system 150 may comprise dedicated space withina building that houses computing systems and their associatedcomponents, e.g., storage systems and communication systems. Typically,a data center will include racks of communication equipment,servers/hosts, and disks. The servers/hosts and disks comprise physicalcomputing resources that are used to provide virtual computing resourcessuch as VMs. To the extent a given cloud computing system includes morethan one data center, those data centers may be at different geographiclocations within relatively close proximity to each other, chosen todeliver services in a timely and economically efficient manner, as wellprovide redundancy and maintain high availability. Similarly, differentcloud computing systems are typically provided at different geographiclocations.

As shown in FIG. 1 , computing system 150 may be illustrated ascomprising host machines 152, storage 154, and infrastructure 160. Hostmachines 152, storage 154, and infrastructure 160 may comprise a datacenter within a cloud computing system 150. Infrastructure 160 maycomprise switches, physical links (e.g., fiber), and other equipmentused to interconnect host machines within a data center with storage154. Storage 154 may comprise a disk or other storage device that ispartitionable to provide physical or virtual storage to virtual machinesrunning on processing devices within a data center. Storage 154 may beprovided as a SAN within the datacenter hosting the virtual machinessupported by storage 154 or in a different data center that does notshare a physical location with the virtual machines it supports. One ormore hosts or other computer systems within a given data center may beconfigured to act as a supervisory agent or hypervisor in creating andmanaging virtual machines associated with one or more host machines in agiven data center. In general, a host or computer system configured tofunction as a hypervisor will contain the instructions necessary to, forexample, manage the operations that result from providing IaaS, PaaS, orSaaS to customers or users as a result of requests for servicesoriginating at, for example, computing devices 110.

In the example shown in FIG. 2 , a distributed system 200, such as thatshown in relation to cloud systems 150 of FIG. 1 , includes a collection204 of host machines 210 (e.g., hardware resources 210) supporting orexecuting the virtual computing environment 300. The virtual computingenvironment 300 includes a virtual machine manager (VMM) 320 and avirtual machine (VM) layer 340 running one or more virtual machines(VMs) 350 a-n configured to execute instances 362 a, 362 a-n of one ormore software applications 360. Each host machine 210 may include one ormore physical central processing units (pCPU) 212 (“data processinghardware 212”) and associated memory hardware 216. While each hardwareresource or host 210 is shown having a single physical processor 212,any hardware resource 210 may include multiple physical processors 212.Hosts 210 also include physical memory 216, which may be partitioned byhost operating system (OS) 220 into virtual memory and assigned for useby VMs 350 in the VM layer 340, or even the VMM 320 or host OS 220.Physical memory 216 may comprise random access memory (RAM) and/or diskstorage (including storage 154 accessible via infrastructure 160 asshown in FIG. 1 ).

Host operating system (OS) 220 may execute on a given one of the hostmachines 210 or may be configured to operate across a collection,including a plurality, of the host machines 210. For convenience, FIG. 2shows the host OS 220 as operating across the collection of machines 210₁ through 210 _(m). Further, while the host OS 220 is illustrated asbeing part of the virtual computing environment 300, each host machine210 is equipped with its own OS 218. However, from the perspective of avirtual environment, the OS on each machine appears as and is managed asa collective OS 220 to a VMM 320 and VM layer 340.

In some examples, the VMM 320 corresponds to a hypervisor 320 (e.g., aCompute Engine) that includes at least one of software, firmware, orhardware configured to create, instantiate/deploy, and execute the VMs350. A computer, such as data processing hardware 212, associated withthe VMM 320 that executes the one or more VMs 350 is typically referredto as a host machine 210 (as used above), while each VM 350 may bereferred to as a guest machine. Here, the VMM 320 or hypervisor isconfigured to provide each VM 350 a corresponding guest operating system(OS) 354, e.g., 354 a-n, having a virtual operating platform and managesexecution of the corresponding guest OS 354 on the VM 350. As usedherein, each VM 350 may be referred to as an “instance” or a “VMinstance.” In some examples, multiple instances of a variety ofoperating systems may share virtualized resources. For instance, a firstVM 350 of the Linux® operating system, a second VM 350 of the Windows®operating system, and a third VM 350 of the OS X® operating system mayall run on a single physical x86 machine.

The VM layer 340 includes one or more virtual machines 350. Thedistributed system 200 enables a user (through one more computingdevices 110) to launch VMs 350 on demand, i.e., by sending a command orrequest 170 (FIG. 1 ) to the distributed system 200 (comprising a cloudsystem 150) via the network 140. For instance, the command/request 170may include an image or snapshot associated with the correspondingoperating system 220 and the distributed system 200 may use the image orsnapshot to create a root resource 210 for the corresponding VM 350.Here, the image or snapshot within the command/request 170 may include aboot loader, the corresponding operating system 220, and a root filesystem. In response to receiving the command/request 170, thedistributed system 200 may instantiate the corresponding VM 350 andautomatically start the VM 350 upon instantiation.

A VM 350 emulates a real computer system (e.g., host machine 210) andoperates based on the computer architecture and functions of the realcomputer system or a hypothetical computer system, which may involvespecialized hardware, software, or a combination thereof. In someexamples, the distributed system 200 authorizes and authenticates theuser device 110 before launching the one or more VMs 350. An instance362 of a software application 360, or simply an instance, refers to a VM350 hosted on (executing on) the data processing hardware 212 of thedistributed system 200.

The host OS 220 virtualizes underlying host machine hardware and managesconcurrent execution of one or more VM instances 350. For instance, hostOS 220 may manage VM instances 350 a-n and each VM instance 350 mayinclude a simulated version of the underlying host machine hardware, ora different computer architecture. The simulated version of the hardwareassociated with each VM instance 350, 350 a-n is referred to as virtualhardware 352, 352 a-n. The virtual hardware 352 may include one or morevirtual central processing units (vCPUs) (“virtual processor”) emulatingone or more physical processors 212 of a host machine 210. The virtualprocessor may be interchangeably referred to a “computing resource”associated with the VM instance 350. The computing resource may includea target computing resource level required for executing thecorresponding individual service instance 362.

The virtual hardware 352 may further include virtual memory incommunication with the virtual processor and storing guest instructions(e.g., guest software) executable by the virtual processor forperforming operations. For instance, the virtual processor may executeinstructions from the virtual memory that cause the virtual processor toexecute a corresponding individual service instance 362 of the softwareapplication 360. Here, the individual service instance 362 may bereferred to as a guest instance that cannot determine if it is beingexecuted by the virtual hardware 352 or the physical data processinghardware 212. A host machine's microprocessor(s) can includeprocessor-level mechanisms to enable virtual hardware 352 to executesoftware instances 362 of applications 360 efficiently by allowing guestsoftware instructions to be executed directly on the host machine'smicroprocessor without requiring code-rewriting, recompilation, orinstruction emulation. The virtual memory may be interchangeablyreferred to as a “memory resource” associated with the VM instance 350.The memory resource may include a target memory resource level requiredfor executing the corresponding individual service instance 362.

The virtual hardware 352 may further include at least one virtualstorage device that provides run time capacity for the service on thephysical memory hardware 212. The at least one virtual storage devicemay be referred to as a storage resource associated with the VM instance350. The storage resource may include a target storage resource levelrequired for executing the corresponding individual service instance362. The guest software executing on each VM instance 350 may furtherassign network boundaries (e.g., allocate network addresses) throughwhich respective guest software can communicate with other processesreachable through an internal network 160 (FIG. 1 ), the externalnetwork 140 (FIG. 1 ), or both. The network boundaries may be referredto as a network resource associated with the VM instance 350.

The guest OS 354 executing on each VM 350 includes software thatcontrols the execution of the corresponding individual service instance362, e.g., one or more of 362 a-n of the application 360 by the VMinstance 350. The guest OS 354, 354 a-n executing on a VM instance 350,350 a-n can be the same or different as the other guest OS 354 executingon the other VM instances 350. In some implementations, a VM instance350 does not require a guest OS 354 in order to execute the individualservice instance 362. The host OS 220 may further include virtual memoryreserved for a kernel 226 of the host OS 220. The kernel 226 may includekernel extensions and device drivers, and may perform certain privilegedoperations that are off limits to processes running in a user processspace of the host OS 220. Examples of privileged operations includeaccess to different address spaces, access to special functionalprocessor units in the host machine 210 such as memory management units,and so on. A communication process 224 running on the host OS 220 mayprovide a portion of VM network communication functionality and mayexecute in the user process space or a kernel process space associatedwith the kernel 226.

In accordance with aspects of the disclosed technology, unrecoverablememory errors, for example bit flips, that occur on a host machine 210that implements MCE may be managed at the hypervisor layer to mitigateand/or avoid affected guest VMs crashing and contain the impact ofunrecoverable memory error to only affected guest VMs. For example, theBIOS associated with a given host machine 210 is configured so that MCEsgenerated by pCPU 212 on the host are sent to kernel 226. The MCEincludes context information about the error including, for example, thephysical memory address, the severity of the error, whether the error isan isolated error, a component within a pCPU where the error wassignaled from, etc. Kernel 226 relays the error to the hypervisor 320.Hypervisor 320 then processes that information to identify the virtualmemories associated with the error and identifies any affected memorypages, as well as associated VMs. As VMs typically do not share virtualmemory, a given memory error may be isolated to a given VM. Therefore,there is little to no risk of propagating the error beyond the affectedVM(s). Hypervisor 320 then isolates the corrupted memory page to avoidthe guest OS from accessing it. Next, the hypervisor informs theaffected guest OS of the error by emulating the error. Specifically, thehypervisor injects an interrupt, e.g., interrupt 80, to the guest OS,which informs the guest OS of the error. In this way, for example, onlya VM affected by the error is notified of the error and only that VM orthe application associated with that VM may be restarted.

In addition, having been notified of corrupted virtual memory addressesor a memory page containing such addresses, the affected VM may avoidreading from or accessing those memory locations, which results incontainment of the error. For example, each memory read or access of acorrupted memory element generates an MCE. An aspect of disclosedtechnology mitigates and/or avoids causing multiple reads or accessingof corrupted memory elements after it's detected at the host level andthe VMM and/or guest OS are notified of the error.

In other examples, a user application may be running across multiplevirtual machines, and a memory error associated with a single VM mayimpact multiple VMs (e.g., a machine learning training job). In suchexamples, the impact of the error may require that more than one VM benotified of the error. For instance, if the hypervisor had distributed agiven job or jobs among more than one VM, the hypervisor may thenbroadcast the error to all affected VMs. In this instance, the user maydecide that shutting down and restarting the affected application is theviable option. In contrast, where a single VM is involved, keeping theVM alive by, for example, providing it with a new memory page, orrestarting it may be a viable option.

Example Processes or Methods

An example of a processing flow or method 370 in accordance with aspectsof the disclosed technology is shown in FIG. 3 . Host 372 includes aBIOS, CPU, and a kernel (as part of its OS). The host is configured todetect uncorrectable memory errors and issue machine check exceptions(MCE) in response to such detection. In addition, a capability toclassify detected uncorrectable memory errors is also provided. Forexample, the classification may include where the error is discovered,whether it is recoverable or not, and what type of recovery is allowedor necessary. For instance, some hardware architectures relay contextinformation that signals software that recovery is not possible andtherefore the kernel needs to enter panic mode. A typical example wherethat occurs is when execution context is corrupted (e.g., error occursin the middle of a CPU executing certain instructions). When anuncorrectable memory error is detected in host 372, the BIOS sends anMCE to the CPU, line 376.

The CPU then relays the MCE information (depicted as #MC) to the kernelof host 372, line 378. #MC and MCE, or MCE information, may comprise thesame context information or the same type of context information. Ahandler (e.g., MCE or #MC Handler) within the kernel receives the MCEinformation (#MC) as to the uncorrectable memory event, includingcontext information, and signals (line 382) an MCE signal handler inhypervisor 386. Signaling may occur via a bus error signal (e.g.,SIGBUS). Hypervisor 386 decodes the MCE information and maps it to thevirtual memory space associated with the VMs supported by the affectedhost, line 388. In doing so, hypervisor 386 determines the virtualmemory and memory page associated with the corrupted memory element. Inaddition, the hypervisor 386 emulates the MCE event, line 388. That is,the hypervisor 386 translates the context information associated withthe physical memory error into context information associated with thevirtual memory location. The hypervisor then provides error emulation byinjecting the MCE information (#MC injection) into the affected VMenvironment, line 390. Specifically, the hypervisor initiates aninterrupt to the vCPU and provides the #MC injection to the vCPU.

The vCPU then forwards the #MC injection to the guest OS running on theVM 391, line 392. A handler within the guest kernel of the guest OS, inresponse to receipt of the #MC injection for example, issues a bus errorsignal to the guest user space (e.g., the guest user space application),line 394. Injecting the #MC or MCE information into guest OS and/or #MCHandler in the guest kernel signalling the guest user space 395 has theeffect of keeping the VM instance alive. The guest user application maythen decide to continue running by recovering from the memory errorsignaled, shutting down the affected VM, or restarting the VM instance,line 398. Possible recovery actions may include remapping and reloadingthe affected memory page or restarting the guest user space program. Forexample, if the content of the corrupted memory element can bereconstructed from storage, a new memory page may be remapped andreloaded, and the instance or application may continue to run or be keptalive. Alternatively, if the application can survive a restart, thenthat course of action may be undertaken. In some circumstances,depending on the error, live migration of the VMs residing on the hostmay be necessary, but even here this can be handled more gracefully ascompared to an abrupt shutdown of the host.

As indicated from the foregoing, aspects of the disclosed technologyinclude having a MCE handler of a host kernel signal all the relevantMCE details to a virtual machine manager or hypervisor. With thehypervisor, a MCE SIGBUS handler records memory error events in, forinstance, a VmEvents table. The events table may include a field thatrecords the following details: regular VM metadata (e.g., VM id, projectid; MCE details: DIMM, rank, bank, MCA registers from all relevantbanks). Optionally, neighbor information may also be recorded, e.g.,which other VMs are on the host, on the same socket, etc. Neighborinformation may be important in analyzing potential security attacks,such as for example a Row Hammer attack. In such an example, thedisclosed technology may notify the guest user space of all the affectedVMs and cause initiation of more graceful failover to another host.

Memory error containment and memory error recovery is enabled in theBIOS, along with 110 stop and scream. Error signaling via a specific newMSI/NMI handler is added to the host kernel with the behavior of justpanic to the host. The host kernel is configured to know which addressspace the MCE error belongs to and if the process is a VM.

FIG. 4 illustrates a method or process 400 in accordance with aspects ofthe disclosed technology. As shown, the method includes detecting andforwarding MCE relating to uncorrectable memory errors to a virtualmachine manager or hypervisor, block 410. The MCE information is decodedand mapped by the virtual machine manager or hypervisor to affectedmemory pages, and thus to the affected VM, block 420. The virtualmachine manager or hypervisor then notifies the guest OS, which in turnnotifies the guest space, block 430. At the guest user space, it may bedetermined to keep alive the affected VM instance or application,terminate the application, or restart the application, block 440.Further details regarding these operations have been described above.

Unless otherwise stated, the foregoing alternative examples are notmutually exclusive, but may be implemented in various combinations toachieve unique advantages. As these and other variations andcombinations of the features discussed above can be utilized withoutdeparting from the subject matter defined by the claims, the foregoingdescription of the embodiments should be taken by way of illustrationrather than by way of limitation of the subject matter defined by theclaims. In addition, the provision of the examples described herein, aswell as clauses phrased as “such as,” “including,” and the like, shouldnot be interpreted as limiting the subject matter of the claims to thespecific examples; rather, the examples are intended to illustrate onlyone of many possible embodiments. Further, the same reference numbers indifferent drawings can identify the same or similar elements.

The invention claimed is:
 1. A method for uncorrectable memory errorrecovery in a cloud computing environment, comprising: receiving, at avirtual machine managing manager, signaling of an uncorrectable memoryerror detected by a host machine; determining, at the virtual machinemanaging manager, one or more virtual machines associated with acorrupted memory element based on the received signaling; emulating amemory error associated with the corrupted memory element based on theuncorrectable memory error; and introducing, by the virtual machinemanaging manager, the emulated memory error into an operatingenvironment of at least one of the one or more virtual machines, andwherein the emulated memory error comprises a notification that causesthe at least one of the one or more virtual machines to signal theuncorrectable memory error to a guest user space.
 2. The method of claim1, wherein the introducing the emulated memory error comprises thevirtual machine managing manager injecting an interrupt that is acceptedby a virtual central processing unit (vCPU) of each of the one or morevirtual machines.
 3. The method of claim 1, wherein the emulated memoryerror comprises a notification that causes the at least one of the oneor more virtual machines to be restarted or terminated.
 4. The method ofclaim 1, wherein signaling comprises a BIOS of the host machineforwarding information associated with the uncorrectable memory error toan operating system of the host machine.
 5. The method of claim 4,comprising the operating system of the host machine forwarding theinformation associated with the uncorrectable memory error to thevirtual machine manager.
 6. The method of claim 5, wherein introducingcomprises the virtual machine manager injecting the emulated memoryerror into a process of a virtual central processing unit of the atleast one virtual machine.
 7. The method of claim 1, wherein theemulated memory error comprises context information associated with theuncorrectable memory error including one or more of a location, a type,or a severity.
 8. The method of claim 1, wherein the virtual machinemanaging manager comprises a hypervisor.
 9. The method of claim 1,wherein determining the one or more virtual machines associated with thecorrupted memory element comprises identifying at least one memory pageassociated with the corrupted memory element.
 10. The method of claim 1,wherein the emulated memory error comprises a notification that causes aguest operating system of the at least one of the one or more virtualmachines to signal the uncorrectable memory error to a guest user space.11. A cloud computing system, comprising: a host machine capable ofsupporting one or more virtual machines; and one or more processingdevices coupled to a memory containing instructions, the instructionscausing the one or more processing devices to: receive signaling fromthe host machine, the signaling indicating an uncorrectable memoryerror; determine, from among the one or more virtual machines, a virtualmachine associated with a corrupted memory element based on the receivedsignaling; emulate a memory error associated with the corrupted memoryelement based on the uncorrectable memory error; and inject the emulatedmemory error into an operating environment of the virtual machineassociated with the corrupted memory element, and wherein the emulatedmemory error comprises a notification that causes the virtual machineassociated with the corrupted memory element to signal the uncorrectablememory error to a guest user space.
 12. The cloud computing system ofclaim 11, wherein the instructions causing the one or more processingdevices to inject the emulated memory error comprises causing the one ormore processing devices to inject an interrupt that is accepted by avirtual central processing unit (vCPU) of the virtual machine associatedwith the corrupted memory element.
 13. The cloud computing system ofclaim 11, wherein the emulated memory error comprises a notificationthat causes the virtual machine associated with the corrupted memoryelement to be restarted or terminated.
 14. The cloud computing system ofclaim 11, wherein a BIOS of the host machine is configured to forwardinformation associated with the uncorrectable memory error to anoperating system of the host machine.
 15. The cloud computing system ofclaim 14, comprising the operating system of the host machine forwardingthe information associated with the uncorrectable memory error to theone or more processing devices.
 16. The cloud computing system of claim11, wherein the emulated memory error comprises context informationassociated with the uncorrectable memory error including one or more ofa location, a type, or a severity.
 17. The cloud computing system ofclaim 11, wherein the one or more processing devices comprises ahypervisor.
 18. The cloud computing system of claim 11, wherein theinstructions causing the one or more processing devices to determinecomprises identifying at least one memory page associated with thecorrupted memory element.